Skip to content

feat: add read_tabix and read_bam_references convenience functions#171

Closed
pkerpedjiev wants to merge 5 commits intoabdenlab:mainfrom
pkerpedjiev:feat/tabix-bam-references
Closed

feat: add read_tabix and read_bam_references convenience functions#171
pkerpedjiev wants to merge 5 commits intoabdenlab:mainfrom
pkerpedjiev:feat/tabix-bam-references

Conversation

@pkerpedjiev
Copy link
Copy Markdown
Collaborator

  • read_tabix: queries a BGZF tabix-indexed file for records in a genomic region, returning Arrow IPC bytes with chrom/start/end/raw columns. Accepts file paths and file-like objects; index can be a .tbi/.csi file path or file-like.

  • read_bam_references: reads reference sequence names and lengths from a BAM file header, returning Arrow IPC bytes with name/length columns. Useful for building chromsizes without scanning the full file.

These functions maintain backward compatibility with clients that relied on the same API in earlier oxbow versions (e.g. HiGlass/clodius).

- read_tabix: queries a BGZF tabix-indexed file for records in a genomic
  region, returning Arrow IPC bytes with chrom/start/end/raw columns.
  Accepts file paths and file-like objects; index can be a .tbi/.csi file
  path or file-like.

- read_bam_references: reads reference sequence names and lengths from a
  BAM file header, returning Arrow IPC bytes with name/length columns.
  Useful for building chromsizes without scanning the full file.

These functions maintain backward compatibility with clients that relied
on the same API in earlier oxbow versions (e.g. HiGlass/clodius).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@pkerpedjiev pkerpedjiev force-pushed the feat/tabix-bam-references branch from cf35486 to db333a7 Compare March 15, 2026 23:29
pkerpedjiev and others added 4 commits March 15, 2026 17:03
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@nvictus
Copy link
Copy Markdown
Member

nvictus commented Mar 16, 2026

Thanks, @pkerpedjiev !

For the bam references, doesn't this do the job already? The same property exists for all data sources.

chromsizes = ox.from_bam(...).chrom_sizes

For tabix/CSI-indexed files, the BED datasource already does what you need (from_bed(..., bed_schema="bed3+", index="...")), as long as chrom, start, end are the first 3 columns as required by the BED standard. With the last PR you can also apply custom type parsing to extended BED columns to do parsing on the Rust side (bed_schema=("bed3", {"foo": "int", "bar": "string"}).

DataSources are the preferred API for oxbow, returning iterators that expose record batches from Rust to Python with zero copy.

If I were to add something, I do think that we need a more generic BED-like TSV reader for the tabix use case where chrom, start, end are not the first 3 fields, as tabix allows this.

@pkerpedjiev
Copy link
Copy Markdown
Collaborator Author

Yeah, I think I can use both of those. I'll try to update the clodius PR with those changes and reopen and modify this if it doesn't work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants